When discussing linear model based learning in Chapters 8 - 11 we employed the generic linear model
\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + x_1w_1 + \cdots + x_Nw_N \end{equation}for both regression and classification.
where $\mu_n$ and $\sigma_n$ are the mean and standard deviation along the $n^{th}$ feature of the input, respectively
model model takes the form model above and we can see that it is the distribution of each perceptron / activation output over the input datawhere
\begin{array} \ \mu_{f_b^{(1)}} = \frac{1}{P}\sum_{p=1}^{P}f_b^{(1)}\left(\mathbf{x}_p \right) \\ \sigma_{f_b^{(1)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_b^{(1)}\left(\mathbf{x}_p \right) - \mu_{f_b^{(1)}} \right)^2} \end{array}Python implementation of batch normalizationΒΆ# fully evaluate our network features using the tensor of weights in w
def feature_transforms(a, w):
# loop through each layer matrix
for W in w:
# pad with ones (to compactly take care of bias) for next layer computation
o = np.ones((1,np.shape(a)[1]))
a = np.vstack((o,a))
# compute linear combination of current layer units
a = np.dot(a.T, W).T
# pass through activation
a = activation(a)
return a
for loop that subtracts off the mean of a along each of its inputs and divides off their associated standard deviations (provided they are non-zerostandard_normaliizer function# standard normalization function
def standard_normalizer(x):
# compute the mean and standard deviation of the input
x_means = np.mean(x,axis = 1)[:,np.newaxis]
x_stds = np.std(x,axis = 1)[:,np.newaxis]
# check to make sure thta x_stds > small threshold, for those not
# divide by 1 instead of original standard deviation
ind = np.argwhere(x_stds < 10**(-2))
if len(ind) > 0:
ind = [v[0] for v in ind]
adjust = np.zeros((x_stds.shape))
adjust[ind] = 1.0
x_stds += adjust
# create standard normalizer function
normalizer = lambda data: (data - x_means)/x_stds
# return normalizer
return normalizer
feature_transforms_batch_normalized # a multilayer perceptron network, note the input w is a tensor of weights, with
# activation output normalization
def feature_transforms_batch_normalized(a, w):
# loop through each layer matrix
for W in w:
# pad with ones (to compactly take care of bias) for next layer computation
o = np.ones((1,np.shape(a)[1]))
a = np.vstack((o,a))
# compute linear combination of current layer units
a = np.dot(a.T, W).T
# pass through activation
a = activation(a)
# NEW - perform standard normalization to the activation outputs
normalizer = standard_normalizer(a)
a = normalizer(a)
return a
relu units $f^{(1)}_1$ and $ f^{(1)}_2$model's linear combination $w_0$ and $w_1$ - change dramatically as the gradient descent algorithm progresses. In this example we illustrate the covariate shift of a standard $4$ layer multilayer perceptron with two units per layer, using the relu activation and the same dataset employed in the previous example.
Each layer's output distribution is shown in this panel, with the output of the first layer $\left(f_1^{(1)},f_2^{(1)}\right)$ are colored in cyan, the second layer $\left(f_1^{(2)},f_2^{(2)}\right)$ is colored magenta, the third layer $\left(f_1^{(3)},f_2^{(3)}\right)$ colored lime green, and the fourth layer $\left(f_1^{(4)},f_2^{(4)}\right)$ is shown in orange. In analogy to the animation shown above for a single layer network, here the horizontal and vertical quantites of each point shown represent the activation output of the first and second unit respectively for each layer.